Skip to content

evidence(distill): Phase 3 F-DISTILL-SMOKE-001 PASS on gx10 GB10#1828

Merged
noahgift merged 5 commits into
mainfrom
evidence/distill-phase-3-victory
May 20, 2026
Merged

evidence(distill): Phase 3 F-DISTILL-SMOKE-001 PASS on gx10 GB10#1828
noahgift merged 5 commits into
mainfrom
evidence/distill-phase-3-victory

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

🎉 F-DISTILL-SMOKE-001 DISCHARGED

Real distillation 1.5B Qwen2.5-Coder teacher → 0.5B Qwen2.5-Coder student on Blackwell GB10 (sm_121):

initial_loss = 7.6746
final_loss   = 7.2036   ← LESS THAN initial
62 steps, 122.7s, no errors

Phase 3 of SPEC-DISTILL-001 is COMPLETE.

What this proves

End-to-end on Blackwell with the full cascade:

  • ✅ Teacher load (1.5B Qwen → 28 transformer blocks)
  • ✅ Student load (0.5B Qwen → 24 transformer blocks)
  • ✅ Forward pass (cuBLAS + pre-warmed PTX)
  • ✅ KD loss computation (kd_step + DistillationLoss)
  • ✅ Backward pass (no JIT-mid-training stream poisoning)
  • ✅ Optimizer step (gradient accumulation + AdamW)
  • ✅ Multi-step convergence (loss decreasing)
  • ✅ Output checkpoint written (student-trained.apr/model.safetensors)

Cascade landed

# PR What
1 #1804 PMAT-700-B cuBLAS prewarm skip
2 #1808 PMAT-698e workspace cap (2048)
3 #1809 PMAT-698f APR magic in weights loader
4 #1810 PMAT-698g non-LoRA backward pre-warm
5 #1813 PMAT-698h rms_norm_gamma_reduce pre-warm
6 #1815 PMAT-698i FWD-CACHE diagnostic logging
7 #1817 PMAT-698j THE root cause — warm! macro key
8 #1820 PMAT-698k cache-key alignment (rope fwd, rmsnorm eps)
9 #1823 PMAT-698m smoke setup non-degenerate batch
10 #1824 post-mortem doc
11 #1827 PMAT-698n rmsnorm pre-warm at 1e-6 + 1e-5

Test plan

Evidence-only PR; the actual code changes already landed across the 11 PRs above. This PR captures the proof-of-success log + dispatch manifest for posterity.

🤖 Generated with Claude Code

2026-05-20 — real distillation 1.5B teacher → 0.5B student on
Blackwell GB10 with the full PMAT-698e..n + PMAT-700-B cascade active.

  initial_loss = 7.6746
  final_loss   = 7.2036   ← LESS THAN initial
  62 steps, 122.7s, no errors

F-DISTILL-SMOKE-001 ("final_loss < initial_loss") discharged.

Phase 3 of SPEC-DISTILL-001 is COMPLETE.

Evidence:
- evidence/distill-phase-3-real-kd/dispatch.json — dispatch manifest
- evidence/distill-phase-3-real-kd/launch-final-pass.txt — full training log

Run dir on gx10: /home/noah/runs/distill-smoke-20260520-070404/
Trained student checkpoint: student-trained.apr/model.safetensors

Cascade summary (all merged):
- #1804 PMAT-700-B  (cuBLAS prewarm skip)
- #1808 PMAT-698e   (workspace cap)
- #1809 PMAT-698f   (APR magic in weights loader)
- #1810 PMAT-698g   (non-LoRA backward pre-warm)
- #1813 PMAT-698h   (rms_norm_gamma_reduce pre-warm)
- #1815 PMAT-698i   (FWD-CACHE diagnostic logging)
- #1817 PMAT-698j   (THE root cause — warm! macro key)
- #1820 PMAT-698k   (cache-key alignment: rope fwd + rmsnorm eps)
- #1823 PMAT-698m   (smoke setup: non-degenerate batch)
- #1824             (post-mortem doc)
- #1827 PMAT-698n   (rmsnorm pre-warm at both 1e-6 + 1e-5 eps)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 20, 2026 05:09
@noahgift noahgift merged commit 1898eb6 into main May 20, 2026
10 checks passed
@noahgift noahgift deleted the evidence/distill-phase-3-victory branch May 20, 2026 09:05
noahgift added a commit that referenced this pull request May 20, 2026
…ASSES (Phase 4 ladder) (#1845)

2026-05-20 12:34 UTC — first end-to-end Phase 4 dispatch with real corpus
(.bin shards via ShardBatchSource). 0.5B Qwen2.5-Coder teacher → 0.5B
student on Blackwell GB10 (sm_121), 100-step trial.

  initial_loss = 15.6094
  final_loss   =  6.0095   ← Δ = -9.60 (-62% reduction)
  124 steps, 232.4s, 1.87 sec/step

This is the first real-corpus Phase 4 dispatch. The synthetic Phase 3
victory (#1828, -0.47 over 62 steps) and the seq_len=256 Stage A smoke
(#1833, -6.80) both predicted Phase 4 readiness; Stage C confirms it
with strictly better convergence on real data (codeparrot Python
tokenized to Qwen vocab, 10 shards / 383 MB).

What this validates:
- ShardBatchSource (PR #1836, PMAT-PHASE4-STAGE-B-1) reads .bin shards
  correctly and produces non-degenerate batches
- Pipeline integration (PR #1839, PMAT-PHASE4-STAGE-B-2) swaps from
  synthetic → real source via with_batch_source() cleanly
- Dispatch script DATASET_DIR knob (PR #1840) end-to-end through gx10
- Full Phase 4 readiness for the 50K-step Stage D run (compute-gated,
  requires user check-in per autonomous-mode rule)

Cascade math:
  Stage A:  Δloss = -6.80 over 62 steps  (synthetic, seq=256)
  Stage C:  Δloss = -9.60 over 124 steps (real corpus, seq=256)
  Per-step loss decrease:
    Stage A: -0.110/step
    Stage C: -0.077/step
  Stage A's per-step rate is higher because synthetic data has zero
  variance — every batch is the same identity-mapping task. Real-corpus
  Stage C has higher variance but covers more concepts, so absolute
  delta is larger.

Phase 4 ladder progress:
  Stage A (#1833)              ✅ MERGED + verified
  Stage B-1 (#1836)            ✅ MERGED
  Stage B-2 (#1839)            ✅ MERGED
  Stage C-prep (#1840)         ✅ MERGED
  Stage B-1.5 tests (#1841)    🟡 in CI
  Stage C trial (THIS evidence) ✅ PASSED 2026-05-20
  Stage D 50K dispatch          ⏳ awaiting user check-in (28h GB10 compute)
  Stage E HumanEval pass@1      ⏳ Phase 5 (turnkey post-Stage-D)
  Stage F publish v2            ⏳ Phase 6 (turnkey post-Stage-E)

Evidence:
- evidence/distill-stage-c-trial/dispatch.json — dispatch manifest
- evidence/distill-stage-c-trial/launch-victory.txt — full training log

Run dir on gx10: /home/noah/runs/distill-smoke-20260520-123259/
Trained checkpoint: student-trained.apr/model.safetensors

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 20, 2026
… 4 RUNNING (#1851)

Captures the live state of the distillation epic as of 2026-05-20:

  Phase 1 — Teacher provider              ✅ MERGED (#1786, #1787)
  Phase 2 — Student fwd/bwd + KD          ✅ MERGED (#1788#1797)
  Phase 3 — E2E smoke on Blackwell GB10   ✅ DISCHARGED (#1828)
  Phase 3b — seq_len=256 scale verify     ✅ DISCHARGED (#1833)
  Phase 4 — 50K training (Stage D)        🟡 RUNNING (PID 196378, gx10)
  Phase 5 — HumanEval pass@1              ⏳ ready (#1847)
  Phase 6 — Publish v2                    ⏳ ready (#1848)

Inserts a new top-of-doc status table that points at:
- The 11-PR Blackwell cascade (post-mortem in blackwell-cascade-postmortem.md)
- Stage C real-corpus dispatch result (15.61 → 6.01 over 124 steps)
- Stage D running with ETA ~22h from 2026-05-20 13:43 UTC
- Phase 5/6 turnkey scripts ready post-D

This captures institutional knowledge for the team and future sessions:
the spec doc reflects what's actually shipped rather than the original
plan from 2026-05-18 when the epic was still scaffolded.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant